GWASLab
  • Home
  • Tutorial for gwaslab
  • Sumstats manipulation
    • SumstatsObject
    • StatusCode
    • Standardization&normalization
    • QC&Filtering
    • Harmonization
    • Liftover
    • Data conversion
  • Visualization
    • Manhattan & QQ plot
    • Regional Plot
    • Brisbane Plot
    • Miami Plot
    • Scatter: effect size comparison
    • Scatter: allele frequency comparison
    • Heatmap: genetic correlation
    • Forest Plot
    • Gallery
  • Utilities
    • Extract Lead Variants
    • Extract Novel Variants
    • Format & Save
    • Load LDSC log
    • Convert Heritability
    • Infer Genome Build
  • References
    • Download
    • Reference data
    • Common data
    • Update logs
  • Examples
    • Standardization and QC example
      • Standardization Workflow
      • QC & Filtering
    • Harmonization example
      • Workflow
      • Liftover
    • Formatting and saving
    • Utility example
      • Data conversion
      • Get novel variants
    • Visualization example
      • Manhattan and Q-Q plot
      • Miami plot
      • Regional plot
      • Brisbane plot
    • Reference management
      • Download
  • Previous
  • Next
  • GitHub
  • Tutorial for gwaslab
  • Download sample data
  • Import gwaslab package
  • Loading data into gwaslab Sumstats
  • Create Manhattan and Q-Q plot
  • Standardization & QC : basic_check()
  • Extract lead variants : get_lead()
  • Use the SNPID to create some highly customized mqq plot
  • Quick regional plot without LD-information
  • reference download
  • check available reference
  • download reference using gwaslab
  • Sampling
  • Infer genome build
  • Fix_id
  • Harmonise
  • Formatting and saving : to_format()
  • Liftover

Tutorial for gwaslab¶

  • In this tutorial, we will perform a common pipeline for sumstats QC, standardization and harmonization.
  • We will also show examples of visualization.
  • Using a jupyter notebook, we first download sample data.

Download sample data¶

The data we will use as an example: sumstats of type 2 diabetes from BBJ

In [1]:
Copied!
!wget -O t2d_bbj.txt.gz http://jenger.riken.jp/14/
!wget -O t2d_bbj.txt.gz http://jenger.riken.jp/14/
--2022-10-21 15:41:14--  http://jenger.riken.jp/14/
Resolving jenger.riken.jp (jenger.riken.jp)... 134.160.84.25
Connecting to jenger.riken.jp (jenger.riken.jp)|134.160.84.25|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 274187574 (261M) [text/plain]
Saving to: ‘t2d_bbj.txt.gz’

t2d_bbj.txt.gz      100%[===================>] 261.49M  38.0MB/s    in 7.3s    

2022-10-21 15:41:22 (36.0 MB/s) - ‘t2d_bbj.txt.gz’ saved [274187574/274187574]

Import gwaslab package¶

If you download it from pip, simply run:

In [1]:
Copied!
import gwaslab as gl
import gwaslab as gl

Or if you want to use the latest version from github (beta version) you can clone the repo and import like:

In [1]:
Copied!
import sys
sys.path.insert(0,"/Users/he/work/gwaslab/src")
import gwaslab as gl
import sys sys.path.insert(0,"/Users/he/work/gwaslab/src") import gwaslab as gl

Loading data into gwaslab Sumstats¶

Let's import this raw sumstats into the gwaslab Sumstats Object by specifying the necessary columns.

Note: you can either specify eaf (effect allele frequency) or neaf (non-effect allele frequency), if neaf is specified, it will be converted to eaf when loading sumstats.

In [2]:
Copied!
mysumstats = gl.Sumstats("t2d_bbj.txt.gz",
             snpid="SNP",
             chrom="CHR",
             pos="POS",
             ea="ALT",
             nea="REF",
             neaf="Frq",
             beta="BETA",
             se="SE",
             p="P",
             direction="Dir",
             n="N")
mysumstats = gl.Sumstats("t2d_bbj.txt.gz", snpid="SNP", chrom="CHR", pos="POS", ea="ALT", nea="REF", neaf="Frq", beta="BETA", se="SE", p="P", direction="Dir", n="N")
Tue Dec 13 17:08:19 2022 Start to initiate from file :t2d_bbj.txt.gz
Tue Dec 13 17:08:51 2022  -Reading columns          : ALT,SE,N,Dir,SNP,Frq,BETA,CHR,REF,P,POS
Tue Dec 13 17:08:51 2022  -Renaming columns to      : EA,SE,N,DIRECTION,SNPID,EAF,BETA,CHR,NEA,P,POS
Tue Dec 13 17:08:51 2022  -Current Dataframe shape : 12557761  x  11
Tue Dec 13 17:08:58 2022  -Initiating a status column: STATUS ...
Tue Dec 13 17:09:03 2022  -NEAF is specified...
Tue Dec 13 17:09:03 2022  -Checking if 0<= NEAF <=1 ...
Tue Dec 13 17:09:07 2022  -Converted NEAF to EAF.
Tue Dec 13 17:09:07 2022  -Removed 0 variants with bad NEAF.
Tue Dec 13 17:09:07 2022 Start to reorder the columns...
Tue Dec 13 17:09:07 2022  -Current Dataframe shape : 12557761  x  12
Tue Dec 13 17:09:07 2022  -Reordering columns to    : SNPID,CHR,POS,EA,NEA,EAF,BETA,SE,P,N,DIRECTION,STATUS
Tue Dec 13 17:09:08 2022 Finished sorting columns successfully!
Tue Dec 13 17:09:09 2022 Finished loading data successfully!

Sumstats are stored in Sumstats.data as apandas.DataFrame, you can check the data by:

In [3]:
Copied!
mysumstats.data
mysumstats.data
Out[3]:
SNPID CHR POS EA NEA EAF BETA SE P N DIRECTION STATUS
0 1:725932_G_A 1 725932 G A 0.9960 -0.0737 0.1394 0.5970 166718 -?+- 9999999
1 1:725933_A_G 1 725933 G A 0.0040 0.0737 0.1394 0.5973 166718 +?-+ 9999999
2 1:737801_T_C 1 737801 C T 0.0051 0.0490 0.1231 0.6908 166718 +?-+ 9999999
3 1:749963_T_TAA 1 749963 TAA T 0.8374 0.0213 0.0199 0.2846 166718 -?++ 9999999
4 1:751343_T_A 1 751343 T A 0.8593 0.0172 0.0156 0.2705 166718 -?++ 9999999
... ... ... ... ... ... ... ... ... ... ... ... ...
12557756 X:154874837_A_G X 154874837 G A 0.7478 -0.0064 0.0117 0.5840 191764 -+-+ 9999999
12557757 X:154875192_GTACTC_G X 154875192 GTACTC G 0.2525 0.0071 0.0122 0.5612 191764 +-+- 9999999
12557758 X:154879115_A_G X 154879115 G A 0.7463 -0.0070 0.0122 0.5646 191764 -+-+ 9999999
12557759 X:154880669_T_A X 154880669 T A 0.2558 0.0071 0.0122 0.5618 191764 +-+- 9999999
12557760 X:154880917_C_T X 154880917 C T 0.2558 0.0072 0.0122 0.5570 191764 +-+- 9999999

12557761 rows × 12 columns

In [17]:
Copied!
mysumstats.random_variants(n=10000,inplace=True)
mysumstats.random_variants(n=10000,inplace=True)
Tue Dec 13 19:46:13 2022 Start to randomly select variants from the sumstats...
Tue Dec 13 19:46:13 2022  -Number of variants selected from the sumstats: 10000
Tue Dec 13 19:46:17 2022 Finished sampling...
In [19]:
Copied!
mysumstats.to_format(path="test",fmt="vep")
mysumstats.to_format(path="test",fmt="vep")
Tue Dec 13 19:47:48 2022 Start to format the output sumstats in:  vep  format
Tue Dec 13 19:47:48 2022  -Formatting statistics ...
Tue Dec 13 19:47:48 2022  - Float statistics formats:
Tue Dec 13 19:47:48 2022   - Columns: ['EAF', 'BETA', 'SE', 'P']
Tue Dec 13 19:47:48 2022   - Output formats: ['{:.4g}', '{:.4f}', '{:.4f}', '{:.4e}']
Tue Dec 13 19:47:48 2022  - Start outputting sumstats in vep format...
Tue Dec 13 19:47:49 2022  -Number of SNPs : 9152
Tue Dec 13 19:47:49 2022  -Number of Insertions : 588
Tue Dec 13 19:47:49 2022  -Number of Deletions : 260
Tue Dec 13 19:47:49 2022  -formatting to 1-based bed-like file (for vep)...
Tue Dec 13 19:47:49 2022  -Output columns: Index(['CHR', 'START', 'END', 'NEA/EA', 'STRAND', 'SNPID'], dtype='object')
Tue Dec 13 19:47:49 2022  -Output path: test.vep.gz
Tue Dec 13 19:47:49 2022  -Saving log file: test.vep.log
Tue Dec 13 19:47:49 2022 Finished outputting successfully!
In [20]:
Copied!
mysumstats.to_format(path="mypath",fmt="vep")
mysumstats.to_format(path="mypath",fmt="vep")
Tue Dec 13 20:04:25 2022 Start to format the output sumstats in:  vep  format
Tue Dec 13 20:04:25 2022  -Formatting statistics ...
Tue Dec 13 20:04:25 2022  - Float statistics formats:
Tue Dec 13 20:04:25 2022   - Columns: ['EAF', 'BETA', 'SE', 'P']
Tue Dec 13 20:04:25 2022   - Output formats: ['{:.4g}', '{:.4f}', '{:.4f}', '{:.4e}']
Tue Dec 13 20:04:25 2022  - Start outputting sumstats in vep format...
Tue Dec 13 20:04:25 2022  -Number of SNPs : 9152
Tue Dec 13 20:04:25 2022  -Number of Insertions : 588
Tue Dec 13 20:04:25 2022  -Number of Deletions : 260
Tue Dec 13 20:04:25 2022  -formatting to 1-based bed-like file (for vep)...
Tue Dec 13 20:04:25 2022  -Output columns: Index(['CHR', 'START', 'END', 'NEA/EA', 'STRAND', 'SNPID'], dtype='object')
Tue Dec 13 20:04:25 2022  -Output path: mypath.vep.gz
Tue Dec 13 20:04:25 2022  -Saving log file: mypath.vep.log
Tue Dec 13 20:04:25 2022 Finished outputting successfully!

For details on gwaslab Sumstats Object, see: https://cloufield.github.io/gwaslab/SumstatsObject/

Create Manhattan and Q-Q plot¶

Maybe the first thing you want to check is the manhattan and qq plots. gwaslab will do a minimum QC for sumstats when plotting.

For details on Manhattan and Q-Q plot, see: https://cloufield.github.io/gwaslab/Visualization/

In [4]:
Copied!
# skip : skip variants with MLOG10P<2 for faster plotting speed
mysumstats.plot_mqq(skip=2)
# skip : skip variants with MLOG10P<2 for faster plotting speed mysumstats.plot_mqq(skip=2)
Tue Dec 13 17:09:10 2022 Start to plot manhattan/qq plot with the following basic settings:
Tue Dec 13 17:09:10 2022  -Genome-wide significance level is set to 5e-08 ...
Tue Dec 13 17:09:10 2022  -Raw input contains 12557761 variants...
Tue Dec 13 17:09:10 2022  -Plot layout mode is : mqq
Tue Dec 13 17:09:13 2022 Finished loading specified columns from the sumstats.
Tue Dec 13 17:09:13 2022 Start conversion and sanity check:
Tue Dec 13 17:09:14 2022  -Removed 0 variants with nan in CHR or POS column ...
Tue Dec 13 17:09:14 2022  -Removed 0 variants with nan in P column ...
Tue Dec 13 17:09:14 2022  -P values are being converted to -log10(P)...
Tue Dec 13 17:09:14 2022  -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed...
Tue Dec 13 17:09:17 2022  -Sanity check: 0 na/inf/-inf variants will be removed...
Tue Dec 13 17:09:19 2022  -Maximum -log10(P) values is 167.58838029403677 .
Tue Dec 13 17:09:19 2022 Finished data conversion and sanity check.
Tue Dec 13 17:09:20 2022 Start to create manhattan plot with 332882 variants:
Tue Dec 13 17:09:22 2022  -Found 89 significant variants with a sliding window size of 500 kb...
Tue Dec 13 17:09:22 2022 Finished creating Manhattan plot successfully!
Tue Dec 13 17:09:22 2022  -Skip annotating
Tue Dec 13 17:09:22 2022 Start to create QQ plot with 332882 variants:
Tue Dec 13 17:09:23 2022  -Calculating lambda GC: 1.2128258399293808
Tue Dec 13 17:09:23 2022 Finished creating QQ plot successfully!
Out[4]:
(<Figure size 3000x1000 with 2 Axes>, <gwaslab.Log.Log at 0x7fa69cfbd580>)

Standardization & QC : basic_check()¶

It is needed to check SNP ID, rsID, chromosome, basepair position, alleles and statistics first before any manipulations (sometimes you need do this before plotting if the sumstats is in a really messy format):

In [5]:
Copied!
#check SNPID,rsID,CHR,POS,EA, NEA and statistics
mysumstats.basic_check()
#check SNPID,rsID,CHR,POS,EA, NEA and statistics mysumstats.basic_check()
Tue Dec 13 17:09:29 2022 Start to check IDs...
Tue Dec 13 17:09:29 2022  -Current Dataframe shape : 12557761  x  12
Tue Dec 13 17:09:29 2022  -Checking if SNPID is chr:pos:ref:alt...(separator: - ,: , _)
Tue Dec 13 17:09:50 2022 Finished checking IDs successfully!
Tue Dec 13 17:09:50 2022 Start to fix chromosome notation...
Tue Dec 13 17:09:50 2022  -Current Dataframe shape : 12557761  x  12
Tue Dec 13 17:10:30 2022  -Vairants with standardized chromosome notation: 12228970
Tue Dec 13 17:11:09 2022  -Vairants with fixable chromosome notations: 328791
Tue Dec 13 17:11:09 2022  -No unrecognized chromosome notations...
Tue Dec 13 17:11:13 2022  -Identified  328791  variants on sex chromosomes...
Tue Dec 13 17:11:13 2022  -Standardizing sex chromosome notations: X Y MT  to 23,24,25...
Tue Dec 13 17:11:44 2022 Finished fixing chromosome notation successfully!
Tue Dec 13 17:11:44 2022 Start to fix basepair positions...
Tue Dec 13 17:11:44 2022  -Current Dataframe shape : 12557761  x  12
Tue Dec 13 17:11:44 2022  -Converting to Int64 data type ...
Tue Dec 13 17:12:10 2022  -Position upper_bound is: 250,000,000
Tue Dec 13 17:12:49 2022  -Remove outliers: 0
Tue Dec 13 17:13:03 2022  -Converted all position to datatype Int64.
Tue Dec 13 17:13:03 2022 Finished fixing basepair position successfully!
Tue Dec 13 17:13:03 2022 Start to fix alleles...
Tue Dec 13 17:13:03 2022  -Current Dataframe shape : 12557761  x  12
Tue Dec 13 17:13:10 2022  -Detected 0 variants with alleles that contain bases other than A/C/T/G .
Tue Dec 13 17:13:13 2022  -Converted all bases to string datatype and UPPERCASE.
Tue Dec 13 17:13:56 2022 Finished fixing allele successfully!
Tue Dec 13 17:13:56 2022 Start sanity check for statistics ...
Tue Dec 13 17:13:56 2022  -Current Dataframe shape : 12557761  x  12
Tue Dec 13 17:14:06 2022  -Checking if  0 <=N<= inf  ...
Tue Dec 13 17:14:17 2022  -Removed 0 variants with bad N.
Tue Dec 13 17:14:17 2022  -Checking if  0 <=EAF<= 1  ...
Tue Dec 13 17:14:22 2022  -Removed 0 variants with bad EAF.
Tue Dec 13 17:14:22 2022  -Checking if  5 <=MAC<= inf  ...
Tue Dec 13 17:14:28 2022  -Removed 0 variants with bad MAC.
Tue Dec 13 17:14:28 2022  -Checking if  5e-300 <= P <= 1  ...
Tue Dec 13 17:14:31 2022  -Removed 0 variants with bad P.
Tue Dec 13 17:14:31 2022  -Checking if  -10 <BETA)< 10  ...
Tue Dec 13 17:14:36 2022  -Removed 0 variants with bad BETA.
Tue Dec 13 17:14:36 2022  -Checking if  0 <SE< inf  ...
Tue Dec 13 17:14:40 2022  -Removed 0 variants with bad SE.
Tue Dec 13 17:14:40 2022  -Checking STATUS...
Tue Dec 13 17:14:43 2022  -Coverting STAUTUS to interger.
Tue Dec 13 17:14:46 2022  -Removed 0 variants with bad statistics in total.
Tue Dec 13 17:14:46 2022 Finished sanity check successfully!
Tue Dec 13 17:14:48 2022 Start to normalize variants...
Tue Dec 13 17:14:48 2022  -Current Dataframe shape : 12557761  x  12
Tue Dec 13 17:15:08 2022  -Not normalized allele IDs:X:7151130_TT_TC X:9093382_CTTTT_CTTT X:12292253_ATTT_ATT X:16001576_ATT_ATTT X:33822416_GT_GTT ... 
Tue Dec 13 17:15:08 2022  -Not normalized allele:['TT' 'TC']['CTTTT' 'CTTT']['ATTT' 'ATT']['ATTT' 'ATT']['GTT' 'GT']... 
Tue Dec 13 17:15:08 2022  -Modified 13 variants according to parsimony and left alignment principal.
Tue Dec 13 17:15:09 2022 Finished normalizing variants successfully!
In [6]:
Copied!
mysumstats.data
mysumstats.data
Out[6]:
SNPID CHR POS EA NEA EAF BETA SE P N DIRECTION STATUS
0 1:725932_G_A 1 725932 G A 0.9960 -0.0737 0.1394 0.5970 166718 -?+- 9960099
1 1:725933_A_G 1 725933 G A 0.0040 0.0737 0.1394 0.5973 166718 +?-+ 9960099
2 1:737801_T_C 1 737801 C T 0.0051 0.0490 0.1231 0.6908 166718 +?-+ 9960099
3 1:749963_T_TAA 1 749963 TAA T 0.8374 0.0213 0.0199 0.2846 166718 -?++ 9960399
4 1:751343_T_A 1 751343 T A 0.8593 0.0172 0.0156 0.2705 166718 -?++ 9960099
... ... ... ... ... ... ... ... ... ... ... ... ...
12557756 X:154874837_A_G 23 154874837 G A 0.7478 -0.0064 0.0117 0.5840 191764 -+-+ 9960099
12557757 X:154875192_GTACTC_G 23 154875192 GTACTC G 0.2525 0.0071 0.0122 0.5612 191764 +-+- 9960399
12557758 X:154879115_A_G 23 154879115 G A 0.7463 -0.0070 0.0122 0.5646 191764 -+-+ 9960099
12557759 X:154880669_T_A 23 154880669 T A 0.2558 0.0071 0.0122 0.5618 191764 +-+- 9960099
12557760 X:154880917_C_T 23 154880917 C T 0.2558 0.0072 0.0122 0.5570 191764 +-+- 9960099

12557761 rows × 12 columns

By checking the log, the sumstats look good. But we found several variants that were not normalizaed and gwaslab fixed this.

.basic_check() is a wrapper of all the following basic functions, you can use these separately.

  • mysumstats.fix_ID()
  • mysumstats.fix_chr()
  • mysumstats.fix_pos()
  • mysumstats.fix_allele()
  • mysumstats.check_sanity()
  • mysumstats.normalize_allele()

For other options, see: https://cloufield.github.io/gwaslab/Standardization/

Extract lead variants : get_lead()¶

Let's extract the lead variants in each significant loci to check our data.

The significant loci are detected based on a sliding window (default window size: windowsizekb=500 kb)

By specifying anno=True , gwaslab will also annotate the lead variant with its nearest gene names and distance.

Note: gwaslab default genome build version is build="19" (GRCh37/hg19), you can change it to build="38" (GRCh38/hg38) when needed.

In [7]:
Copied!
mysumstats.get_lead(anno=True)
mysumstats.get_lead(anno=True)
Tue Dec 13 17:15:10 2022 Start to extract lead variants...
Tue Dec 13 17:15:10 2022  -Processing 12557761 variants...
Tue Dec 13 17:15:10 2022  -Significance threshold : 5e-08
Tue Dec 13 17:15:10 2022  -Sliding window size: 500  kb
Tue Dec 13 17:15:16 2022  -Found 9461 significant variants in total...
Tue Dec 13 17:15:17 2022  -Identified 89 lead variants!
Tue Dec 13 17:15:17 2022 Start to annotate variants with nearest gene name(s)...
Tue Dec 13 17:15:17 2022  -Assigning Gene name using Ensembl Release hg19
Tue Dec 13 17:15:22 2022 Finished annotating variants with nearest gene name(s) successfully!
Tue Dec 13 17:15:22 2022 Finished extracting lead variants successfully!
Out[7]:
SNPID CHR POS EA NEA EAF BETA SE P N DIRECTION STATUS LOCATION GENE
96739 1:22068326_A_G 1 22068326 G A 0.7550 0.0621 0.0103 1.629000e-09 191764 ++++ 9960099 0 USP48
213860 1:51103268_T_C 1 51103268 C T 0.7953 -0.0802 0.0120 2.519000e-11 191764 ---- 9960099 0 FAF1
534095 1:154309595_TA_T 1 154309595 TA T 0.0947 -0.0915 0.0166 3.289000e-08 191764 ---- 9960399 0 ATP8B2
969974 2:640986_CACAT_C 2 640986 C CACAT 0.9006 -0.0946 0.0150 2.665000e-10 191764 ---- 9960399 26349 TMEM18
1091807 2:27734972_G_A 2 27734972 G A 0.5605 0.0691 0.0088 3.897000e-15 191764 ++++ 9960099 0 GCKR
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
12272930 X:21569920_A_G 23 21569920 G A 0.3190 0.0423 0.0076 2.616000e-08 191764 ++++ 9960099 0 CNKSR2
12341406 X:48724648_CAA_C 23 48724648 C CAA 0.6260 -0.0602 0.0103 4.576000e-09 191764 ---- 9960399 26082 TIMM17B
12350767 X:57170781_A_AT 23 57170781 AT A 0.3003 -0.0447 0.0076 4.583000e-09 191764 ---- 9960399 -6723 SPIN2A
12469290 X:117915163_T_TA 23 117915163 TA T 0.5560 0.0548 0.0071 9.818000e-15 191764 ++++ 9960399 0 IL13RA1
12554976 X:152908887_G_A 23 152908887 G A 0.6792 -0.1235 0.0077 9.197000e-58 191764 ---- 9960099 0 DUSP9

89 rows × 14 columns

For other options, see: https://cloufield.github.io/gwaslab/ExtractLead/

Use the SNPID to create some highly customized mqq plot¶

Gwaslab can create more complicated manhattan plot.

plot and save as my_first_mqq_plot.png with {"dpi":400,"facecolor":"white"}

In [8]:
Copied!
mysumstats.plot_mqq(mode="mqq",
                  cut=20,skip=3,
                  anno="GENENAME",
                  anno_set=["9:22132729_A_G","6:20688121_T_A","9:22132729_A_G","15:62394264_G_C"] ,
                  pinpoint=["9:22132729_A_G","5:176513896_C_A"], 
                  highlight=["7:127253550_C_T","19:46166604_C_T"],
                  highlight_windowkb =1000,
                  stratified=True,
                  marker_size=(5,10),
                  figargs={"figsize":(15,5),"dpi":300},
                  save="my_first_mqq_plot.png", 
                  saveargs={"dpi":400,"facecolor":"white"})
mysumstats.plot_mqq(mode="mqq", cut=20,skip=3, anno="GENENAME", anno_set=["9:22132729_A_G","6:20688121_T_A","9:22132729_A_G","15:62394264_G_C"] , pinpoint=["9:22132729_A_G","5:176513896_C_A"], highlight=["7:127253550_C_T","19:46166604_C_T"], highlight_windowkb =1000, stratified=True, marker_size=(5,10), figargs={"figsize":(15,5),"dpi":300}, save="my_first_mqq_plot.png", saveargs={"dpi":400,"facecolor":"white"})
Tue Dec 13 17:15:23 2022 Start to plot manhattan/qq plot with the following basic settings:
Tue Dec 13 17:15:23 2022  -Genome-wide significance level is set to 5e-08 ...
Tue Dec 13 17:15:23 2022  -Raw input contains 12557761 variants...
Tue Dec 13 17:15:23 2022  -Plot layout mode is : mqq
Tue Dec 13 17:15:23 2022  -Variants to annotate : 9:22132729_A_G,6:20688121_T_A,9:22132729_A_G,15:62394264_G_C
Tue Dec 13 17:15:23 2022  -Loci to highlight : 7:127253550_C_T,19:46166604_C_T
Tue Dec 13 17:15:23 2022  -Highlight_window is set to:  1000  kb
Tue Dec 13 17:15:23 2022  -Variants to pinpoint : 9:22132729_A_G,5:176513896_C_A
Tue Dec 13 17:15:31 2022 Finished loading specified columns from the sumstats.
Tue Dec 13 17:15:31 2022 Start conversion and sanity check:
Tue Dec 13 17:15:34 2022  -Removed 0 variants with nan in CHR or POS column ...
Tue Dec 13 17:15:36 2022  -Removed 0 variants with nan in EAF column ...
Tue Dec 13 17:15:38 2022  -Removed 0 variants with nan in P column ...
Tue Dec 13 17:15:38 2022  -P values are being converted to -log10(P)...
Tue Dec 13 17:15:38 2022  -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed...
Tue Dec 13 17:15:41 2022  -Sanity check: 0 na/inf/-inf variants will be removed...
Tue Dec 13 17:15:45 2022  -Maximum -log10(P) values is 167.58838029403677 .
Tue Dec 13 17:15:45 2022  -Minus log10(P) values above 20 will be shrunk with a shrinkage factor of 10...
Tue Dec 13 17:15:45 2022 Finished data conversion and sanity check.
Tue Dec 13 17:15:45 2022 Start to create manhattan plot with 91234 variants:
Tue Dec 13 17:15:45 2022  -Highlighting target loci...
Tue Dec 13 17:15:45 2022  -Pinpointing target vairants...
Tue Dec 13 17:15:45 2022  -Found 3 specified variants to annotate...
Tue Dec 13 17:15:45 2022 Start to annotate variants with nearest gene name(s)...
Tue Dec 13 17:15:45 2022  -Assigning Gene name using Ensembl Release hg19
Tue Dec 13 17:15:46 2022 Finished annotating variants with nearest gene name(s) successfully!
Tue Dec 13 17:15:46 2022 Finished creating Manhattan plot successfully!
Tue Dec 13 17:15:46 2022  -Annotating using column GENENAME...
Tue Dec 13 17:15:46 2022 Start to create QQ plot with 91234 variants:
Tue Dec 13 17:15:48 2022  -Calculating lambda GC: 1.2128258399293808
Tue Dec 13 17:15:48 2022 Finished creating QQ plot successfully!
Tue Dec 13 17:15:48 2022 Saving plot:
Tue Dec 13 17:15:51 2022  -Saved to my_first_mqq_plot.png successfully!
Out[8]:
(<Figure size 4500x1500 with 2 Axes>, <gwaslab.Log.Log at 0x7fa69cfbd580>)

For details, see: https://cloufield.github.io/gwaslab/Visualization/

Quick regional plot without LD-information¶

Gwaslab can plot regional plot with or with out LD reference files.

For details see: https://cloufield.github.io/gwaslab/RegionalPlot/

In [9]:
Copied!
mysumstats.plot_mqq(mode="r",region=(7,156538803,157538803),region_grid=True, figargs= {"figsize":(15,5)})
mysumstats.plot_mqq(mode="r",region=(7,156538803,157538803),region_grid=True, figargs= {"figsize":(15,5)})
Tue Dec 13 17:15:55 2022 Start to plot manhattan/qq plot with the following basic settings:
Tue Dec 13 17:15:55 2022  -Genome-wide significance level is set to 5e-08 ...
Tue Dec 13 17:15:55 2022  -Raw input contains 12557761 variants...
Tue Dec 13 17:15:55 2022  -Plot layout mode is : r
Tue Dec 13 17:15:55 2022  -Region to plot : chr7:156538803-157538803.
Tue Dec 13 17:15:58 2022  -Extract SNPs in region : chr7:156538803-157538803...
Tue Dec 13 17:16:38 2022  -Extract SNPs in specified regions: 5831
Tue Dec 13 17:16:39 2022 Finished loading specified columns from the sumstats.
Tue Dec 13 17:16:39 2022 Start conversion and sanity check:
Tue Dec 13 17:16:39 2022  -Removed 0 variants with nan in CHR or POS column ...
Tue Dec 13 17:16:39 2022  -Removed 0 variants with nan in P column ...
Tue Dec 13 17:16:39 2022  -P values are being converted to -log10(P)...
Tue Dec 13 17:16:39 2022  -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed...
Tue Dec 13 17:16:39 2022  -Sanity check: 0 na/inf/-inf variants will be removed...
Tue Dec 13 17:16:39 2022  -Maximum -log10(P) values is 7.948076083953893 .
Tue Dec 13 17:16:39 2022 Finished data conversion and sanity check.
Tue Dec 13 17:16:39 2022 Start to create manhattan plot with 5831 variants:
Tue Dec 13 17:16:40 2022  -Loading gtf files from:default
Tue Dec 13 17:17:00 2022  -plotting gene track..
Tue Dec 13 17:17:00 2022  -Finished plotting gene track..
Tue Dec 13 17:17:01 2022  -Found 1 significant variants with a sliding window size of 500 kb...
Tue Dec 13 17:17:01 2022 Finished creating Manhattan plot successfully!
Tue Dec 13 17:17:01 2022  -Skip annotating
Out[9]:
(<Figure size 3000x2000 with 3 Axes>, <gwaslab.Log.Log at 0x7fa69cfbd580>)

reference download¶

Full regional plot using a user-provided vcf or preprocessed vcf: (e.g 1000 genome, see Reference: https://cloufield.github.io/gwaslab/Reference/)

check available reference¶

update the available reference list first

In [10]:
Copied!
gl.update_available_ref()
gl.update_available_ref()
Tue Dec 13 17:17:03 2022 Updating available_ref list from: https://raw.github.com/Cloufield/gwaslab/main/src/gwaslab/data/reference.json
Tue Dec 13 17:17:19 2022 Available_ref list has been updated!
In [11]:
Copied!
gl.check_available_ref()
gl.check_available_ref()
Tue Dec 13 17:17:19 2022 Start to check available reference files...
Tue Dec 13 17:17:19 2022  - 1kg_eas_hg19  :  https://www.dropbox.com/s/lztaxqhy2o6dpxw/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz?dl=1
Tue Dec 13 17:17:19 2022  - 1kg_eas_hg19_tbi  :  https://www.dropbox.com/s/k9klefl8m9fcfo1/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi?dl=1
Tue Dec 13 17:17:19 2022  - 1kg_eur_hg19  :  https://www.dropbox.com/s/1nbgqshknevseks/EUR.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz?dl=1
Tue Dec 13 17:17:19 2022  - 1kg_eur_hg19_tbi  :  https://www.dropbox.com/s/vscvkrflh6fc5a0/EUR.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi?dl=1
Tue Dec 13 17:17:19 2022  - 1kg_eas_hg38  :  https://www.dropbox.com/s/3dstbbb1el9r3au/EAS.ALL.split_norm_af.1kg_30x.hg38.vcf.gz?dl=1
Tue Dec 13 17:17:19 2022  - 1kg_eas_hg38_tbi  :  https://www.dropbox.com/s/vwnp5vd8dcqksn4/EAS.ALL.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?dl=1
Tue Dec 13 17:17:19 2022  - 1kg_eur_hg38  :  https://www.dropbox.com/s/z0mkehg17lryapv/EUR.ALL.split_norm_af.1kg_30x.hg38.vcf.gz?dl=1
Tue Dec 13 17:17:19 2022  - 1kg_eur_hg38_tbi  :  https://www.dropbox.com/s/ze8g58x75x9qbf0/EUR.ALL.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?dl=1
Tue Dec 13 17:17:19 2022  - dbsnp_v151_hg19  :  https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/00-All.vcf.gz
Tue Dec 13 17:17:19 2022  - dbsnp_v151_hg38  :  https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/00-All.vcf.gz
Tue Dec 13 17:17:19 2022  - ucsc_genome_hg19  :  http://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips/hg19.fa.gz
Tue Dec 13 17:17:19 2022  - ucsc_genome_hg38  :  https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
Tue Dec 13 17:17:19 2022  - 1kg_dbsnp151_hg19_auto  :  https://www.dropbox.com/s/37p2u1xwmux4gwo/1kg_dbsnp151_hg19_auto.txt.gz?dl=1
Tue Dec 13 17:17:19 2022  - 1kg_dbsnp151_hg38_auto  :  https://www.dropbox.com/s/ouf60n7gdz6cm0g/1kg_dbsnp151_hg38_auto.txt.gz?dl=1
Tue Dec 13 17:17:19 2022  - recombination_hg19  :  https://www.dropbox.com/s/4b39o8m58m5yvqx/recombination_hg19.tar.gz?dl=1
Tue Dec 13 17:17:19 2022  - ensembl_hg19_gtf  :  https://ftp.ensembl.org/pub/grch37/current/gtf/homo_sapiens/Homo_sapiens.GRCh37.87.chr.gtf.gz
Tue Dec 13 17:17:19 2022  - ensembl_hg19_gtf_protein_coding  :  https://www.dropbox.com/s/ovsqf4kyeniakxf/Homo_sapiens.GRCh37.75.protein_coding.gtf.gz?dl=1
Tue Dec 13 17:17:19 2022  - ensembl_hg38_gtf  :  https://ftp.ensembl.org/pub/current_gtf/homo_sapiens/Homo_sapiens.GRCh38.108.chr.gtf.gz
Tue Dec 13 17:17:19 2022  - ensembl_hg38_gtf_protein_coding  :  https://www.dropbox.com/s/37kicx0l6uifa1e/Homo_sapiens.GRCh38.107.protein_coding.chr.gtf.gz?dl=1
Tue Dec 13 17:17:19 2022  - refseq_hg19_gtf  :  https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/GRCh37_latest_genomic.gtf.gz
Tue Dec 13 17:17:19 2022  - refseq_hg19_gtf_protein_coding  :  https://www.dropbox.com/s/gpzflnid7zo3kca/GRCh37_latest_genomic.protein_coding.gtf.gz?dl=1
Tue Dec 13 17:17:19 2022  - refseq_hg38_gtf  :  https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.gff.gz
Tue Dec 13 17:17:19 2022  - refseq_hg38_gtf_protein_coding  :  https://www.dropbox.com/s/reojrsjmbwsu0pp/GRCh38_latest_genomic.protein_coding.gtf.gz?dl=1
Tue Dec 13 17:17:19 2022  - testlink  :  https://www.dropbox.com/s/8u7capwge0ihshu/EAS.chr22.split_norm_af.1kgp3v5.vcf.gz?dl=1
Tue Dec 13 17:17:19 2022  - testlink_tbi  :  https://www.dropbox.com/s/hdneg53t6u1j6ib/EAS.chr22.split_norm_af.1kgp3v5.vcf.gz.tbi?dl=1
Out[11]:
{'1kg_eas_hg19': 'https://www.dropbox.com/s/lztaxqhy2o6dpxw/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz?dl=1',
 '1kg_eas_hg19_tbi': 'https://www.dropbox.com/s/k9klefl8m9fcfo1/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi?dl=1',
 '1kg_eur_hg19': 'https://www.dropbox.com/s/1nbgqshknevseks/EUR.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz?dl=1',
 '1kg_eur_hg19_tbi': 'https://www.dropbox.com/s/vscvkrflh6fc5a0/EUR.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi?dl=1',
 '1kg_eas_hg38': 'https://www.dropbox.com/s/3dstbbb1el9r3au/EAS.ALL.split_norm_af.1kg_30x.hg38.vcf.gz?dl=1',
 '1kg_eas_hg38_tbi': 'https://www.dropbox.com/s/vwnp5vd8dcqksn4/EAS.ALL.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?dl=1',
 '1kg_eur_hg38': 'https://www.dropbox.com/s/z0mkehg17lryapv/EUR.ALL.split_norm_af.1kg_30x.hg38.vcf.gz?dl=1',
 '1kg_eur_hg38_tbi': 'https://www.dropbox.com/s/ze8g58x75x9qbf0/EUR.ALL.split_norm_af.1kg_30x.hg38.vcf.gz.tbi?dl=1',
 'dbsnp_v151_hg19': 'https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/00-All.vcf.gz',
 'dbsnp_v151_hg38': 'https://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/00-All.vcf.gz',
 'ucsc_genome_hg19': 'http://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips/hg19.fa.gz',
 'ucsc_genome_hg38': 'https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz',
 '1kg_dbsnp151_hg19_auto': 'https://www.dropbox.com/s/37p2u1xwmux4gwo/1kg_dbsnp151_hg19_auto.txt.gz?dl=1',
 '1kg_dbsnp151_hg38_auto': 'https://www.dropbox.com/s/ouf60n7gdz6cm0g/1kg_dbsnp151_hg38_auto.txt.gz?dl=1',
 'recombination_hg19': 'https://www.dropbox.com/s/4b39o8m58m5yvqx/recombination_hg19.tar.gz?dl=1',
 'ensembl_hg19_gtf': 'https://ftp.ensembl.org/pub/grch37/current/gtf/homo_sapiens/Homo_sapiens.GRCh37.87.chr.gtf.gz',
 'ensembl_hg19_gtf_protein_coding': 'https://www.dropbox.com/s/ovsqf4kyeniakxf/Homo_sapiens.GRCh37.75.protein_coding.gtf.gz?dl=1',
 'ensembl_hg38_gtf': 'https://ftp.ensembl.org/pub/current_gtf/homo_sapiens/Homo_sapiens.GRCh38.108.chr.gtf.gz',
 'ensembl_hg38_gtf_protein_coding': 'https://www.dropbox.com/s/37kicx0l6uifa1e/Homo_sapiens.GRCh38.107.protein_coding.chr.gtf.gz?dl=1',
 'refseq_hg19_gtf': 'https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh37_latest/refseq_identifiers/GRCh37_latest_genomic.gtf.gz',
 'refseq_hg19_gtf_protein_coding': 'https://www.dropbox.com/s/gpzflnid7zo3kca/GRCh37_latest_genomic.protein_coding.gtf.gz?dl=1',
 'refseq_hg38_gtf': 'https://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.gff.gz',
 'refseq_hg38_gtf_protein_coding': 'https://www.dropbox.com/s/reojrsjmbwsu0pp/GRCh38_latest_genomic.protein_coding.gtf.gz?dl=1',
 'testlink': 'https://www.dropbox.com/s/8u7capwge0ihshu/EAS.chr22.split_norm_af.1kgp3v5.vcf.gz?dl=1',
 'testlink_tbi': 'https://www.dropbox.com/s/hdneg53t6u1j6ib/EAS.chr22.split_norm_af.1kgp3v5.vcf.gz.tbi?dl=1'}

download reference using gwaslab¶

In [12]:
Copied!
gl.download_ref("1kg_eas_hg19")
gl.download_ref("1kg_eas_hg19")
Tue Dec 13 17:17:19 2022 Start to download  1kg_eas_hg19  ...
Tue Dec 13 17:17:19 2022  -Downloading to: /Users/he/opt/anaconda3/lib/python3.8/site-packages/gwaslab/data/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz
Tue Dec 13 17:24:21 2022  -Updating record in config file...
Tue Dec 13 17:24:23 2022  -Updating record in config file...
Tue Dec 13 17:24:23 2022  -Downloading to: /Users/he/opt/anaconda3/lib/python3.8/site-packages/gwaslab/data/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz.tbi
Tue Dec 13 17:24:23 2022 Downloaded  1kg_eas_hg19  successfully!

after downloading, use get_path to select the file path

In [13]:
Copied!
mysumstats.plot_mqq(mode="r",region=(7,156538803,157538803),region_grid=True,vcf_path=gl.get_path("1kg_eas_hg19"))
mysumstats.plot_mqq(mode="r",region=(7,156538803,157538803),region_grid=True,vcf_path=gl.get_path("1kg_eas_hg19"))
Tue Dec 13 17:24:23 2022 Start to plot manhattan/qq plot with the following basic settings:
Tue Dec 13 17:24:23 2022  -Genome-wide significance level is set to 5e-08 ...
Tue Dec 13 17:24:23 2022  -Raw input contains 12557761 variants...
Tue Dec 13 17:24:23 2022  -Plot layout mode is : r
Tue Dec 13 17:24:23 2022  -Region to plot : chr7:156538803-157538803.
Tue Dec 13 17:24:27 2022  -Extract SNPs in region : chr7:156538803-157538803...
Tue Dec 13 17:24:58 2022  -Extract SNPs in specified regions: 5831
Tue Dec 13 17:24:59 2022 Finished loading specified columns from the sumstats.
Tue Dec 13 17:24:59 2022 Start conversion and sanity check:
Tue Dec 13 17:24:59 2022  -Removed 0 variants with nan in CHR or POS column ...
Tue Dec 13 17:24:59 2022  -Removed 0 variants with nan in P column ...
Tue Dec 13 17:24:59 2022  -P values are being converted to -log10(P)...
Tue Dec 13 17:24:59 2022  -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed...
Tue Dec 13 17:24:59 2022  -Sanity check: 0 na/inf/-inf variants will be removed...
Tue Dec 13 17:24:59 2022  -Maximum -log10(P) values is 7.948076083953893 .
Tue Dec 13 17:24:59 2022 Finished data conversion and sanity check.
Tue Dec 13 17:24:59 2022 Start to load reference genotype...
Tue Dec 13 17:24:59 2022  -reference vcf path : /Users/he/opt/anaconda3/lib/python3.8/site-packages/gwaslab/data/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz
Tue Dec 13 17:25:01 2022  -Retrieving index...
Tue Dec 13 17:25:01 2022  -Ref variants in the region: 35278
Tue Dec 13 17:25:01 2022  -Calculating Rsq...
Tue Dec 13 17:25:01 2022 Finished loading reference genotype successfully!
Tue Dec 13 17:25:01 2022 Start to create manhattan plot with 5831 variants:
Tue Dec 13 17:25:01 2022  -Loading gtf files from:default
Tue Dec 13 17:25:18 2022  -plotting gene track..
Tue Dec 13 17:25:19 2022  -Finished plotting gene track..
Tue Dec 13 17:25:19 2022  -Found 1 significant variants with a sliding window size of 500 kb...
Tue Dec 13 17:25:19 2022 Finished creating Manhattan plot successfully!
Tue Dec 13 17:25:19 2022  -Skip annotating
Out[13]:
(<Figure size 3000x2000 with 4 Axes>, <gwaslab.Log.Log at 0x7fa69cfbd580>)

or you can provide your own vcf file. Let's annotate the lead variant this time.

In [4]:
Copied!
mysumstats.plot_mqq(mode="r",region=(7,156538803,157538803),region_grid=True,anno=True,
                    vcf_path="/Users/he/work/gwaslab/src/gwaslab/data/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz")
mysumstats.plot_mqq(mode="r",region=(7,156538803,157538803),region_grid=True,anno=True, vcf_path="/Users/he/work/gwaslab/src/gwaslab/data/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz")
Tue Dec 13 16:37:36 2022 Start to plot manhattan/qq plot with the following basic settings:
Tue Dec 13 16:37:36 2022  -Genome-wide significance level is set to 5e-08 ...
Tue Dec 13 16:37:36 2022  -Raw input contains 12557761 variants...
Tue Dec 13 16:37:36 2022  -Plot layout mode is : r
Tue Dec 13 16:37:36 2022  -Region to plot : chr7:156538803-157538803.
Tue Dec 13 16:37:42 2022  -Extract SNPs in region : chr7:156538803-157538803...
Tue Dec 13 16:38:27 2022  -Extract SNPs in specified regions: 5831
Tue Dec 13 16:38:29 2022 Finished loading specified columns from the sumstats.
Tue Dec 13 16:38:29 2022 Start conversion and sanity check:
Tue Dec 13 16:38:29 2022  -Removed 0 variants with nan in CHR or POS column ...
Tue Dec 13 16:38:29 2022  -Removed 0 variants with nan in P column ...
Tue Dec 13 16:38:29 2022  -P values are being converted to -log10(P)...
Tue Dec 13 16:38:29 2022  -Sanity check after conversion: 0 variants with P value outside of (0,1] will be removed...
Tue Dec 13 16:38:29 2022  -Sanity check: 0 na/inf/-inf variants will be removed...
Tue Dec 13 16:38:29 2022  -Maximum -log10(P) values is 7.948076083953893 .
Tue Dec 13 16:38:29 2022 Finished data conversion and sanity check.
Tue Dec 13 16:38:29 2022 Start to load reference genotype...
Tue Dec 13 16:38:29 2022  -reference vcf path : /Users/he/work/gwaslab/src/gwaslab/data/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz
Tue Dec 13 16:38:31 2022  -Retrieving index...
Tue Dec 13 16:38:31 2022  -Ref variants in the region: 35278
Tue Dec 13 16:38:32 2022  -Calculating Rsq...
Tue Dec 13 16:38:32 2022 Finished loading reference genotype successfully!
Tue Dec 13 16:38:32 2022 Start to create manhattan plot with 5831 variants:
Tue Dec 13 16:38:32 2022 Start to download  recombination_hg19  ...
Tue Dec 13 16:38:32 2022  -Downloading to: /Users/he/opt/anaconda3/lib/python3.8/site-packages/gwaslab/data/recombination/hg19/recombination_hg19.tar.gz
Tue Dec 13 16:38:41 2022  -Updating record in config file...
Tue Dec 13 16:38:41 2022 Downloaded  recombination_hg19  successfully!
Tue Dec 13 16:38:41 2022  -Loading gtf files from:default
Tue Dec 13 16:38:41 2022 No records in config file. Please download first.
Tue Dec 13 16:38:41 2022 Start to download  ensembl_hg19_gtf  ...
Tue Dec 13 16:38:41 2022  -Downloading to: /Users/he/opt/anaconda3/lib/python3.8/site-packages/gwaslab/data/Homo_sapiens.GRCh37.87.chr.gtf.gz
Tue Dec 13 16:40:46 2022  -Updating record in config file...
Tue Dec 13 16:40:46 2022 Downloaded  ensembl_hg19_gtf  successfully!
Tue Dec 13 16:41:13 2022  -plotting gene track..
Tue Dec 13 16:41:14 2022  -Finished plotting gene track..
Tue Dec 13 16:41:14 2022  -Found 1 significant variants with a sliding window size of 500 kb...
Tue Dec 13 16:41:14 2022 Finished creating Manhattan plot successfully!
Tue Dec 13 16:41:14 2022  -Annotating using column CHR:POS...
Out[4]:
(<Figure size 3000x2000 with 4 Axes>, <gwaslab.Log.Log at 0x7fe1dbeb2580>)

Note: gwaslab default genome build version is build="19" (GRCh37/hg19), you can change it to build="38" (GRCh38/hg38) when needed. For gene tracks, default is gtf_path="ensembl" , you can also use gtf_path="refseq" (NCBA RefSeq)

Sampling¶

There are more than 10 million variants in the original sumstats and it will take long to process the entrie dataset. Since it might take a while to process the entire datasets, let us just random sample 1 million variants for this tutorial.

In [ ]:
Copied!
mysumstats.random_variants(n=100000,inplace=True)
mysumstats.random_variants(n=100000,inplace=True)

Infer genome build¶

In case you don't know the genome build of the sumstats

For details, see: https://cloufield.github.io/gwaslab/InferBuild/

In [ ]:
Copied!
mysumstats.infer_build()
mysumstats.infer_build()

Fix_id¶

You may notice that the SNPID is in CHR:POS_REF_ALT format. We want SNPID to be in a stadardized format chr:pos:ref:alt, we can use fix_id for this:

For other options of standardization, see: https://cloufield.github.io/gwaslab/Standardization/

In [ ]:
Copied!
#fixsep : fix ID separator
mysumstats.fix_id(fixsep=True)
#fixsep : fix ID separator mysumstats.fix_id(fixsep=True)
In [ ]:
Copied!
mysumstats.data
mysumstats.data

Harmonise¶

gwaslab can harmonize the sumstats based on reference files.

  • ref_seq : reference genome fasta file for allele alignment
  • ref_rsid_tsv : tsv file for annotation of common used variants
  • ref_rsid_vcf : vcf file for annotation of other variants
  • ref_infer : vcf file with allele frequency information for inferring strand and comparing allele frequency
  • ref_alt_freq : field in INFO of vcf file for alternative allele frequency

For details see: https://cloufield.github.io/gwaslab/Harmonization/

For reference data, see: https://cloufield.github.io/gwaslab/Reference/

In [17]:
Copied!
mysumstats.harmonize(basic_check=False,
                    n_cores=3,
                    ref_seq="/Users/he/Documents/Mydata/human_g1k_v37.fasta",
                    ref_rsid_tsv="/Users/he/Documents/Mydata/EAS_1kg_af_dbsnp151.ALL.tsv",
                    ref_rsid_vcf="/Users/he/Documents/Mydata/All_20180423.vcf.gz",
                    ref_infer="/Users/he/Documents/Mydata/eas_1kg_af/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz",ref_alt_freq="AF")
mysumstats.harmonize(basic_check=False, n_cores=3, ref_seq="/Users/he/Documents/Mydata/human_g1k_v37.fasta", ref_rsid_tsv="/Users/he/Documents/Mydata/EAS_1kg_af_dbsnp151.ALL.tsv", ref_rsid_vcf="/Users/he/Documents/Mydata/All_20180423.vcf.gz", ref_infer="/Users/he/Documents/Mydata/eas_1kg_af/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz",ref_alt_freq="AF")
Wed Nov  2 13:28:01 2022 Start to check if NEA is aligned with reference sequence...
Wed Nov  2 13:28:01 2022  -Current Dataframe shape : 100000  x  12
Wed Nov  2 13:28:01 2022  -Reference genome fasta file: /Users/he/Documents/Mydata/human_g1k_v37.fasta
Wed Nov  2 13:28:01 2022  -Checking records: 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  X  Y  MT  
Wed Nov  2 13:28:39 2022  -Variants allele on given reference sequence :  41568
Wed Nov  2 13:28:39 2022  -Variants flipped :  50063
Wed Nov  2 13:28:39 2022   -Raw Matching rate :  91.63%
Wed Nov  2 13:28:39 2022  -Variants inferred reverse_complement :  0
Wed Nov  2 13:28:39 2022  -Variants inferred reverse_complement_flipped :  0
Wed Nov  2 13:28:39 2022  -Both allele on genome + unable to distinguish :  8369
Wed Nov  2 13:28:39 2022  -Variants not on given reference sequence :  0
Wed Nov  2 13:28:39 2022  -Current Dataframe shape : 100000  x  12
Wed Nov  2 13:28:40 2022 Start to flip allele-specific stats for SNPs with status xxxxx[35]x: alt->ea , ref->nea ... 
Wed Nov  2 13:28:40 2022  -Flipping 50063 variants...
Wed Nov  2 13:28:40 2022  -Swapping column: NEA <=> EA...
Wed Nov  2 13:28:40 2022  -Flipping column: BETA = - BETA...
Wed Nov  2 13:28:40 2022  -Flipping column: EAF = 1 - EAF...
Wed Nov  2 13:28:40 2022  -Flipping column: DIRECTION +-? <=> -+? ...
Wed Nov  2 13:28:40 2022  -Changed the status for flipped variants : xxxxx[35]x -> xxxxx[12]x
Wed Nov  2 13:28:41 2022 Finished converting successfully!
Wed Nov  2 13:28:41 2022 Start to infer strand for palindromic SNPs...
Wed Nov  2 13:28:41 2022  -Current Dataframe shape : 100000  x  12
Wed Nov  2 13:28:41 2022  -Reference vcf file: /Users/he/Documents/Mydata/eas_1kg_af/EAS.ALL.split_norm_af.1kgp3v5.hg19.vcf.gz
Wed Nov  2 13:28:41 2022  -Alternative allele frequency in INFO: AF
Wed Nov  2 13:28:42 2022  -Identified  14159  palindromic SNPs...
Wed Nov  2 13:28:42 2022  -After filtering by MAF<  0.4  , the strand of  12908  palindromic SNPs will be inferred...
Wed Nov  2 13:29:07 2022   -Non-palindromic :  76762
Wed Nov  2 13:29:07 2022   -Palindromic SNPs on + strand:  12520
Wed Nov  2 13:29:08 2022   -Palindromic SNPs on - strand and need to be flipped: 24
Wed Nov  2 13:29:08 2022   -Palindromic SNPs with maf not availble to infer :  1251
Wed Nov  2 13:29:08 2022   -Palindromic SNPs with no macthes or no information :  175
Wed Nov  2 13:29:09 2022  -Identified  8369  indistinguishable Indels...
Wed Nov  2 13:29:09 2022  -Indistinguishable Indels will be inferred from reference vcf ref and alt...
Wed Nov  2 13:29:36 2022   -Indels ea/nea match reference :  3538
Wed Nov  2 13:29:37 2022   -Indels ea/nea need to be flipped :  4531
Wed Nov  2 13:29:37 2022   -Indels with no macthes or no information :  300
Wed Nov  2 13:29:37 2022  -Current Dataframe shape : 100000  x  12
Wed Nov  2 13:29:38 2022 Start to flip allele-specific stats for standardized indels with status xxxx[123][67][6]: alt->ea , ref->nea ... 
Wed Nov  2 13:29:39 2022  -Flipping 4531 variants...
Wed Nov  2 13:29:39 2022  -Swapping column: NEA <=> EA...
Wed Nov  2 13:29:39 2022  -Flipping column: BETA = - BETA...
Wed Nov  2 13:29:39 2022  -Flipping column: EAF = 1 - EAF...
Wed Nov  2 13:29:39 2022  -Flipping column: DIRECTION +-? <=> -+? ...
Wed Nov  2 13:29:39 2022  -Changed the status for flipped variants xxxx[123][67]6 -> xxxx[123][67]4
Wed Nov  2 13:29:39 2022 Start to flip allele-specific stats for palindromic SNPs with status xxxxx[12]5: (-)strand <=> (+)strand ... 
Wed Nov  2 13:29:40 2022  -Flipping 24 variants...
Wed Nov  2 13:29:40 2022  -Flipping column: BETA = - BETA...
Wed Nov  2 13:29:40 2022  -Flipping column: EAF = 1 - EAF...
Wed Nov  2 13:29:40 2022  -Flipping column: DIRECTION +-? <=> -+? ...
Wed Nov  2 13:29:40 2022  -Changed the status for flipped variants:  xxxxx[012]5: ->  xxxxx[012]2
Wed Nov  2 13:29:40 2022 Finished converting successfully!
Wed Nov  2 13:29:40 2022 Start to annotate rsID based on chromosome and position information...
Wed Nov  2 13:29:40 2022  -Current Dataframe shape : 100000  x  12
Wed Nov  2 13:29:40 2022  -SNPID-rsID text file: /Users/he/Documents/Mydata/EAS_1kg_af_dbsnp151.ALL.tsv
Wed Nov  2 13:29:40 2022  -100000 rsID could be possibly fixed...
Wed Nov  2 13:29:40 2022  -Setting block size:  5000000
Wed Nov  2 13:29:40 2022  -Loading block: 0   1   2   3   4   5   6   7   8   9   10   11   12   13   14   15   16   
Wed Nov  2 13:35:24 2022  -rsID Annotation for 3296 need to be fixed!
Wed Nov  2 13:35:24 2022  -Annotated 96704 rsID successfully!
Wed Nov  2 13:35:24 2022 Start to assign rsID using vcf...
Wed Nov  2 13:35:24 2022  -Current Dataframe shape : 100000  x  13
Wed Nov  2 13:35:24 2022  -CPU Cores to use : 3
Wed Nov  2 13:35:24 2022  -Reference VCF file: /Users/he/Documents/Mydata/All_20180423.vcf.gz
Wed Nov  2 13:35:24 2022  -Assigning rsID based on chr:pos and ref:alt/alt:ref...
Wed Nov  2 13:35:41 2022  -rsID Annotation for 618 need to be fixed!
Wed Nov  2 13:35:41 2022  -Annotated 2678 rsID successfully!
Wed Nov  2 13:35:41 2022 Start to sort the genome coordinates...
Wed Nov  2 13:35:41 2022  -Current Dataframe shape : 100000  x  13
Wed Nov  2 13:35:41 2022  -Force converting POS to integers...
Wed Nov  2 13:35:41 2022  -Sorting genome coordinates...
Wed Nov  2 13:35:42 2022 Finished sorting genome coordinates successfully!
Wed Nov  2 13:35:42 2022 Start to reorder the columns...
Wed Nov  2 13:35:42 2022  -Current Dataframe shape : 100000  x  13
Wed Nov  2 13:35:42 2022  -Reordering columns to    : SNPID,rsID,CHR,POS,EA,NEA,EAF,BETA,SE,P,N,DIRECTION,STATUS
Wed Nov  2 13:35:42 2022 Finished sorting columns successfully!
Out[17]:
<gwaslab.Sumstats.Sumstats at 0x7fee8400ff10>

Check the data again. Looks good!

In [18]:
Copied!
mysumstats.data
mysumstats.data
Out[18]:
SNPID rsID CHR POS EA NEA EAF BETA SE P N DIRECTION STATUS
0 1:854168:C:T rs79188446 1 854168 T C 0.1484 -0.0292 0.0156 0.06173 166718 -?-+ 1960010
1 1:856510:C:T rs191530247 1 856510 T C 0.0045 0.0604 0.1278 0.63650 166718 +?+- 1960010
2 1:981131:A:G rs9697293 1 981131 G A 0.0330 0.0183 0.0273 0.50200 166718 -?+- 1960000
3 1:1021408:G:A rs11260587 1 1021408 A G 0.0010 -0.1388 0.2623 0.59680 166718 -?+- 1960010
4 1:1038997:ACTC:A rs71576598 1 1038997 A ACTC 0.2037 0.0162 0.0111 0.14410 191764 +++- 1960364
... ... ... ... ... ... ... ... ... ... ... ... ... ...
99995 X:154429214:A:G rs4893075 23 154429214 G A 0.1802 0.0063 0.0096 0.51160 191764 +-+- 1960000
99996 X:154452019:G:A rs113007852 23 154452019 A G 0.0143 0.0374 0.0437 0.39210 191764 +-++ 1960010
99997 X:154682147:C:A rs5940511 23 154682147 A C 0.7852 -0.0081 0.0084 0.33460 191764 -+-+ 1960010
99998 X:154740552:T:C rs5940461 23 154740552 C T 0.7972 -0.0086 0.0090 0.33840 191764 -+-+ 1960000
99999 X:154803312:C:T rs781981622 23 154803312 T C 0.6074 -0.0080 0.0115 0.48580 191764 -+-+ 1960010

100000 rows × 13 columns

Check the summary of the currrent sumstats (see: https://cloufield.github.io/gwaslab/StatusCode/):

In [19]:
Copied!
mysumstats.summary()
mysumstats.summary()
Out[19]:
Values Percentage
Category Items
META Row_num 100000 NaN
Column_num 6 NaN
Column_names SNPID,rsID,EAF,P,N,STATUS NaN
Last_checked_time Wed Nov 2 13:35:42 2022 NaN
MISSING Missing_total 618 0.62
Missing_rsID 618 0.62
MAF Common 50567 50.57
Low_frequency 17297 17.30
Rare 32073 32.07
P Minimum 6.973e-74 0.00
Significant 73 0.07
Suggestive 140 0.14
STATUS 1960010 42746 42.75
1960000 34016 34.02
1960011 6324 6.32
1960001 6196 6.20
1960364 4531 4.53
1960363 3538 3.54
1960007 629 0.63
1960017 622 0.62
1960309 529 0.53
1960368 300 0.30
1960008 189 0.19
1960319 181 0.18
1960018 175 0.18
1960012 15 0.02
1960002 9 0.01

Formatting and saving : to_format()¶

You can easily format the processed sumstats and save it.

For details see: https://cloufield.github.io/gwaslab/Format/

In [20]:
Copied!
mysumstats.to_format("clean_sumstats",fmt="ldsc")
mysumstats.to_format("clean_sumstats",fmt="ldsc")
Wed Nov  2 13:35:42 2022 Start to format the output sumstats in:  ldsc  format
Wed Nov  2 13:35:42 2022  -Formatting statistics ...
Wed Nov  2 13:35:42 2022  - Float statistics formats:
Wed Nov  2 13:35:42 2022   - Columns: ['EAF', 'BETA', 'SE', 'P']
Wed Nov  2 13:35:42 2022   - Output formats: ['{:.4g}', '{:.4f}', '{:.4f}', '{:.4e}']
Wed Nov  2 13:35:42 2022  - Start outputting sumstats in ldsc format...
Wed Nov  2 13:35:42 2022  -ldsc format will be loaded...
Wed Nov  2 13:35:42 2022  -ldsc format meta info:
Wed Nov  2 13:35:42 2022   - format_name  :  ldsc
Wed Nov  2 13:35:42 2022   - format_source  :  https://github.com/bulik/ldsc/wiki/Summary-Statistics-File-Format
Wed Nov  2 13:35:42 2022   - format_source2  :  https://github.com/bulik/ldsc/blob/master/munge_sumstats.py
Wed Nov  2 13:35:42 2022   - format_version  :  20150306
Wed Nov  2 13:35:42 2022  -gwaslab to ldsc format dictionary:
Wed Nov  2 13:35:42 2022   - gwaslab keys: ['rsID', 'NEA', 'EA', 'N', 'BETA', 'P', 'INFO', 'OR']
Wed Nov  2 13:35:42 2022   - ldsc values: ['SNP', 'A2', 'A1', 'N', 'Beta', 'P', 'INFO', 'OR']
Wed Nov  2 13:35:42 2022  -Output columns: Index(['SNP', 'A1', 'A2', 'Beta', 'P', 'N'], dtype='object')
Wed Nov  2 13:35:42 2022  -Output path: clean_sumstats.ldsc.tsv.gz
Wed Nov  2 13:35:44 2022  -Saving log file: clean_sumstats.ldsc.log
Wed Nov  2 13:35:44 2022 Finished outputting successfully!

Liftover¶

If the sumstats need liftover:

In [21]:
Copied!
mysumstats.liftover(n_cores=1,from_build="19", to_build="38")
mysumstats.liftover(n_cores=1,from_build="19", to_build="38")
Wed Nov  2 13:49:15 2022 Start to perform liftover...
Wed Nov  2 13:49:15 2022  -Current Dataframe shape : 100000  x  13
Wed Nov  2 13:49:15 2022  -CPU Cores to use : 1
Wed Nov  2 13:49:15 2022  -Performing liftover ...
Wed Nov  2 13:49:15 2022  -Creating converter : hg19 to hg38
Wed Nov  2 13:49:16 2022  -Converting variants with status code xxx0xxx :100000...
Wed Nov  2 13:49:50 2022  -Removed unmapped variants: 53
Wed Nov  2 13:49:52 2022 Finished liftover successfully!
In [22]:
Copied!
mysumstats.data
mysumstats.data
Out[22]:
SNPID rsID CHR POS EA NEA EAF BETA SE P N DIRECTION STATUS
0 1:854168:C:T rs79188446 1 918788 T C 0.1484 -0.0292 0.0156 0.06173 166718 -?-+ 3860099
1 1:856510:C:T rs191530247 1 921130 T C 0.0045 0.0604 0.1278 0.63650 166718 +?+- 3860099
2 1:981131:A:G rs9697293 1 1045751 G A 0.0330 0.0183 0.0273 0.50200 166718 -?+- 3860099
3 1:1021408:G:A rs11260587 1 1086028 A G 0.0010 -0.1388 0.2623 0.59680 166718 -?+- 3860099
4 1:1038997:ACTC:A rs71576598 1 1103617 A ACTC 0.2037 0.0162 0.0111 0.14410 191764 +++- 3860399
... ... ... ... ... ... ... ... ... ... ... ... ... ...
99995 X:154429214:A:G rs4893075 23 155200937 G A 0.1802 0.0063 0.0096 0.51160 191764 +-+- 3860099
99996 X:154452019:G:A rs113007852 23 155223738 A G 0.0143 0.0374 0.0437 0.39210 191764 +-++ 3860099
99997 X:154682147:C:A rs5940511 23 155452486 A C 0.7852 -0.0081 0.0084 0.33460 191764 -+-+ 3860099
99998 X:154740552:T:C rs5940461 23 155510891 C T 0.7972 -0.0086 0.0090 0.33840 191764 -+-+ 3860099
99999 X:154803312:C:T rs781981622 23 155573651 T C 0.6074 -0.0080 0.0115 0.48580 191764 -+-+ 3860099

99947 rows × 13 columns

Note: Gwaslab only liftover CHR and POS, and when lifted, the last two digits status code will be rolled back to 99. Since for difference reference genome, the reference allele or strand might be reverse, so it is need to align and check agin.

For details, see https://cloufield.github.io/gwaslab/LiftOver/


GWASLab is licensed under the MIT license
Documentation built with MkDocs.

Keyboard Shortcuts

Keys Action
? Open this help
n Next page
p Previous page
s Search